一、iris数据集介绍 二、一维数据可视化 三、二维数据可视化 四、多维数据可视化 五、参考资料


iris数据集有150个观测值和5个变量,分别是sepal length、sepal width、petal length、petal width、species,其中species有3个取值:setosa、virginica、versicolor,反正就是鸾尾花的3个不同品种吧,各有50个观测值。具体见下表。

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.set(style="white", color_codes=True) #加载iris数据集 from sklearn.datasets import load_iris iris_data = load_iris() iris = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names']) iris = pd.merge(iris, pd.DataFrame(iris_data['target'], columns=['species']), left_index=True, right_index=True) labels = dict(zip([0,1,2], iris_data['target_names'])) iris['species'] = iris['species'].apply(lambda x: labels[x]) iris.head() iris data.png




用boxplot画出单个特征与species的关系,可以看到不同品种的鸾尾花在petal length单个维度上已经可以较好地划分出来,尤其setosa的petal length跟另外两个品种的petal length差别不要太大好吗,一眼就把你给认出来了。 # look at an individual feature in Seaborn through a boxplot sns.boxplot(x='species', y='petal length (cm)', data=iris) box plot kdeplot核密度图 # kdeplot looking at univariate relations # creates and visualizes a kernel density estimate of the underlying feature sns.FacetGrid(iris, hue='species',size=6) \ .map(sns.kdeplot, 'petal length (cm)') \ .add_legend() kdeplot violinplot琴形图:结合了箱线图与核密度估计图的特点,它表征了在一个或多个分类变量情况下,连续变量数据的分布并进行了比较,它是一种观察多个数据分布有效方法。 # A violin plot combines the benefits of the boxplot and kdeplot # Denser regions of the data are fatter, and sparser thiner in a violin plot sns.violinplot(x='species', y='petal length (cm)', data=iris, size=6) violin plot 三、二维数据可视化 散点图:用FacetGrid按照品种标识颜色,便于我们寻找数据间的关系,这里使用了两个特征进行可视化,setosa还是一如既往的好认,virginica跟versicolor还是显得有些难舍难分。 # use seaborn's FacetGrid to color the scatterplot by species sns.FacetGrid(iris, hue="species", size=5) \ .map(plt.scatter, "sepal length (cm)", "sepal width (cm)") \ .add_legend() scatter plot by species pairplot:展现特征的两两关系,简直太棒了好吧! # pairplot shows the bivariate relation between each pair of features # From the pairplot, we'll see that the Iris-setosa species is separataed from the other two across all feature combinations # The diagonal elements in a pairplot show the histogram by default # We can update these elements to show other things, such as a kde sns.pairplot(iris, hue='species', size=3, diag_kind='kde') pairplot 四、多维数据可视化


1. Andrews曲线


# Andrews Curves involve using attributes of samples as coefficients for Fourier series and then plotting these pd.plotting.andrews_curves(iris, 'species') andrews curves 2. 平行坐标


# Parallel coordinates plots each feature on a separate column & then draws lines connecting the features for each data sample pd.plotting.parallel_coordinates(iris, 'species') parallel coordinates 3. RadViz雷达图


# radviz puts each feature as a point on a 2D plane, and then simulates # having each sample attached to those points through a spring weighted by the relative value for that feature pd.plotting.radviz(iris, 'species') radviz 4. 因子分析(FactorAnalysis)



from sklearn import decomposition fa = decomposition.FactorAnalysis(n_components=2) X = fa.fit_transform(iris.iloc[:,:-1].values) pos=pd.DataFrame() pos['X'] =X[:, 0] pos['Y'] =X[:, 1] pos['species'] = iris['species'] ax = pos[pos['species']=='virginica'].plot(kind='scatter', x='X', y='Y', color='blue', label='virginica') pos[pos['species']=='setosa'].plot(kind='scatter', x='X', y='Y', color='green', label='setosa', ax=ax) pos[pos['species']=='versicolor'].plot(kind='scatter', x='X', y='Y', color='red', label='versicolor', ax=ax) fa 5.主成分分析(PCA)


from sklearn import decomposition pca = decomposition.PCA(n_components=2) X = pca.fit_transform(iris.iloc[:,:-1].values) pos=pd.DataFrame() pos['X'] =X[:, 0] pos['Y'] =X[:, 1] pos['species'] = iris['species'] ax = pos[pos['species']=='virginica'].plot(kind='scatter', x='X', y='Y', color='blue', label='virginica') pos[pos['species']=='setosa'].plot(kind='scatter', x='X', y='Y', color='green', label='setosa', ax=ax) pos[pos['species']=='versicolor'].plot(kind='scatter', x='X', y='Y', color='red', label='versicolor', ax=ax) pca



output: array([0.92461621, 0.05301557]) 可以看到保留的两个主成分,第一个主成分可以解释原始变异的92.5%,第二个主成分可以解释原始变异的5.3%。也就是说降成两维后仍保留了原始信息的97.8%。

6. 独立成分分析(ICA)


from sklearn import decomposition fica = decomposition.FastICA(n_components=2) X = fica.fit_transform(iris.iloc[:,:-1].values) pos=pd.DataFrame() pos['X'] =X[:, 0] pos['Y'] =X[:, 1] pos['species'] = iris['species'] ax = pos[pos['species']=='virginica'].plot(kind='scatter', x='X', y='Y', color='blue', label='virginica') pos[pos['species']=='setosa'].plot(kind='scatter', x='X', y='Y', color='green', label='setosa', ax=ax) pos[pos['species']=='versicolor'].plot(kind='scatter', x='X', y='Y', color='red', label='versicolor', ax=ax) ica 7. 多维度量尺(Multi-dimensional scaling, MDS)


from sklearn import manifold from sklearn.metrics import euclidean_distances similarities = euclidean_distances(iris.iloc[:,:-1].values) mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, dissimilarity="precomputed", n_jobs=1) X = mds.fit(similarities).embedding_ pos=pd.DataFrame(X, columns=['X', 'Y']) pos['species'] = iris['species'] ax = pos[pos['species']=='virginica'].plot(kind='scatter', x='X', y='Y', color='blue', label='virginica') pos[pos['species']=='setosa'].plot(kind='scatter', x='X', y='Y', color='green', label='setosa', ax=ax) pos[pos['species']=='versicolor'].plot(kind='scatter', x='X', y='Y', color='red', label='versicolor', ax=ax) mds 8. TSNE(t-distributed Stochastic Neighbor Embedding)


from sklearn.manifold import TSNE iris_embedded = TSNE(n_components=2).fit_transform(iris.iloc[:,:-1]) pos = pd.DataFrame(iris_embedded, columns=['X','Y']) pos['species'] = iris['species'] ax = pos[pos['species']=='virginica'].plot(kind='scatter', x='X', y='Y', color='blue', label='virgnica') pos[pos['species']=='setosa'].plot(kind='scatter', x='X', y='Y', color='green', label='setosa', ax=ax) pos[pos['species']=='versicolor'].plot(kind='scatter', x='X', y='Y', color='red', label='versicolor', ax=ax) TSNE

嗯,我觉得TSNE的结果最可爱了⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄

五、参考资料 Python Data Visualizations 多维数据可视化 比PCA降维更高级——(R/Python)t-SNE聚类算法实践指南




